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We analyze control of the familywise error rate (FWER) in a mul¬ 
tiple testing scenario with a great many null hypotheses about the 
distribution of a high-dimensional random variable among which only 
a very small fraction are false, or “active”. In order to improve power 
relative to conservative Bonferroni bounds, we explore a coarse-to-hne 
procedure adapted to a situation in which tests are partitioned into 
snbsets, or “cells”, and active hypotheses tend to cluster within cells. 

We develop procedures for a standard linear model with Gaussian 
data and a non-parametric case based on generalized permutation 
testing, and demonstrate considerably higher power than Bonferroni 
estimates at the same FWER when the active hypotheses do clus¬ 
ter. The main technical difficulty arises from the correlation between 
the test statistics at the individual and cell levels, which increases 
the likelihood of a hypothesis being falsely discovered when the cell 
that contains it is falsely discovered (survivorship bias). This requires 
sharp estimates of certain quadrant probabilities when a cell is inac¬ 
tive. 


1. Introduction. We consider a multiple testing scenario encountered 
in many current applications of statistics. Given a large index set V and a 
family {Hq{v),v G V) of null hypotheses about the distribution of a high¬ 
dimensional random vector U G we wish to design a procedure, basically 
a family of test statistics and thresholds, to estimate the subset A C V over 
which the null hypotheses are false. We shall refer to A as the “active set” 
and write A = A(U) for our estimator of A based on a random sample U of 
size n from U. The hypotheses in A(U) (namely the ones for which the null 
is rejected) are referred to as “detections” or “discoveries.” Naturally, the 
goal is to maximize the number \A n A(U)| of detected true positives while 
simultaneously controlling the number \A^ C A(U)| of false discoveries. 

There are two widely used criteria for controlling false positives: 
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FWER: Assume that U is defined on the probability space (n,P). The 
family-wise error rate (FWER) is 

FWER(i) = P (^i(U) n AV 0) , 

which is the probability of making at least one false discovery. This is usually 
controlled using Bonferroni bounds and their refinements [9, 13, 11, 10], or 
using resampling methods or random permutation. 

FDR: The false discovery rate (FDR) is the expected ratio between the 
number of false alarms \A'^ n A(U)| and the number of discoveries |A(U)| 
[3, 5, 4]. 

In many cases, including the settings in computational biology which di¬ 
rectly motivate this work, we find |A| <C |R|, n <C d as well as small “effect 
sizes.” This is the case, for example, in genome-wide association studies 
(GWAS) where U = {Y,X^,v G V) and the dependence of the “phenotype” 
Y on the “genotype” {Xy,v G V) is often assumed to be linear; the active 
set A are those v with non-zero coefficients and effect size refers to the frac¬ 
tion of the total variance of Y explained by a particular Xy. Under these 
challenging circumstances, the FWER criterion is usually very conservative 
and power is limited; that is, number of true positive detections is often 
very small (if not null) compared to |A| (the “missing heritability”). This 
is why the less conservative EDR criterion is sometimes preferred: it allows 
for a higher number of true detections, but of course at the expense of false 
positives. However, there are situations, such as GWAS, in which this trade¬ 
off is unacceptable; for example, collecting more data and doing follow-up 
experiments may be too labor intensive or expensive, and therefore having 
even one false discovery may be deemed undesirable. 

To set the stage for our proposal, suppose we are given a family Ty = 
Ty(\J),v G U of test statistics and can assume that deviations from the 
null are captured by small values of rt;(U) (e.g., p-values). We make the 
usual assumption, easily achieved in practice, that the distribution of T^(U) 
does not depend on v when v G A^, and individual rejection regions are of 
the form {u £ l/( : Ty{u) < 9} for a constant 6 independent of v. Defining 
A(U) = {v : r,;(U) < 9}, the Bonferroni upper-bound is 

FWER < y P(r^(U) <9)<\V\ maxP(rRU) < 9). 

To ensure that FWER < a, 9 = 9 b is selected such that P(T^(U) < 9b) < 
a/jUj whenever v £ A^. The Bonferroni bound can only be marginally 


COARSE-TO-FINE MULTIPLE TESTING STRATEGIES 


3 


improved (see, in particular estimator [13], which will be referred to as 
Bonferroni-Holm in the rest of the paper) in the general case. While al¬ 
ternative procedures (including permutation tests) can be designed to take 
advantage of correlations among tests, the bound is sharp when \V\ ^ j^dj 
and tests are independent. 

Coarse-to-fine Testing: Clearly some additional assumptions or domain- 
specific knowledge is necessary to ameliorate the reduction in power resulting 
from controlling the FWER. Motivated by applications in genomics, we sup¬ 
pose the set V has a natural hierarchical structure. In principle, it should 
then be possible to gain power if the active hypotheses are not randomly 
distributed throughout V but rather have a tendency to cluster within cells 
of the hierarchy. In fact, we shall consider the simplest example consisting 
of only two levels corresponding to individual hypotheses indexed by u G E 
and a partition of V into non-overlapping subsets {g cV,g G G), which we 
call “cells.” We will propose a particular multiple testing strategy which is 
coarse-to-fine with respect to this structure, controls the FWER, and whose 
power will exceed that of the standard Bonferroni-Holm approach for typi¬ 
cal models and realistic parameters when a minimal degree of clustering is 
present. It is important to note that clustering property is not a condition 
for a correct control of the FWER at a given level using our coarse-to-fine 
procedure, but only for its increased efficiency in discovering active hypothe¬ 
ses. 

Our estimate of A is now based on two families of test statistics: {T„ (U), u G 
E}, as above, and {Tg(\J),g G G}. The cell-level test Tg is designed to as¬ 
sume small values only when g is “active,” meaning that g r\ A ^ Our 
estimator of A is now 

i(U) = {v : TgiV) < Og, T,(U) < Ov}. 

One theoretical challenge of this method is to derive a tractable method for 
controlling the FWER at a given level a. Evidently, this method can only 
out-perform Bonferroni if > 0b', otherwise, the coarse-to-fine active set 
is a subset of the Bonferroni discoveries. A key parameter is J, an upper 
bound on the number of active cells, and in the next section we will derive 
an FWER bound 

FWER(i(U)) < <^(0G,0v,J) 

under an appropriate compound null hypothesis. 

The main results of the paper are in the ensuing analysis for different 
models for [/. In each case, the first objective is to compute for a given Oq 
and 0v and the second objective is to maximize the power over all pairs 
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{0g,0v) which satisfy < a. The smaller our upper bound on J, the 
stronger is the clustering of active hypotheses in cells and the greater is 
the gain in power compared with the Bonferroni bound. In particular, as 
soon as J <C |G|, the coarse-to-fine strategy will lead to a considerably less 
conservative score threshold for individual hypotheses relative to the Bon¬ 
ferroni estimate and the coarse-to-hne procedure will yield an increase in 
power for a given FWER. Again, our assumptions about clustering are only 
expressed through an upper bound on J; no other assumptions about the 
distribution of A are made and the FWER is controlled in all cases. 

The main technical difficulty arises from the correlation between the cor¬ 
responding test statistics. This must be taken into account since it increases 
the likelihood of an individual index v being falsely declared active when 
the cell g{v) that contains it is falsely discovered (survivorship bias). More 
specifically, we require sharp estimates of quadrant probabilities under the 
joint distribution of Tg(^)(U) and r,;(U) when g{v), the cell containing v, 
is inactive. All these issues will be analyzed in two cases. Eirst, we will 
consider the standard linear model with Gaussian data. In this case is 
expressed in terms of centered chi-square distributions and the power is ex¬ 
pressed in terms of non-centered chi-square distributions. The efficiency of 
the coarse-to-fine method in detecting active hypotheses will depend on ef¬ 
fect sizes, both at the level of cells and individual v, among other factors. A 
non-parametric procedure will then be developed in section 4 based on gen¬ 
eralized permutation testing and invariance assumptions. Finally, we shall 
derive a high-confidence upper bound on J based on a martingale argu¬ 
ment. Extensive simulations comparing the power of the coarse-to-fine and 
Bonferroni-Holm appear throughout. 

Applications and Related Work: As indicated above, our work (and 
some of our notation) is inspired by statistical issues arising in GWAS [7, 
8, 2] and related areas in computational genomics. In the most common 
version of GWAS, the “genotype” of an individual is represented by the 
genetic states at a very large family of genomic locations v G V; these 
variations are called single nucleotide polymorphisms or SNPs. In any given 
study the objective is to hnd those SNPs A C V “associated” with a given 
“phenotype”, for example a measurable trait Y such as height or blood 
pressure. The null hypothesis for SNP v is that Y and are independent 
r.v.s, and whereas \ V\ may run into the millions, the set A of active variants 
is expected to be fewer than one hundred. (Ideally, one seeks the “causal” 
variants, an even smaller set, but separating correlation and causality is 
notoriously difficult.) Gontrol of the FWER is the gold standard and the 
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linear model is common. If the considered variants are confined to coding 
regions, then the set of genes provides a natural partition of V (and the 
fact that genes are organized into pathways provides a natural three-level 
hierarchy) [14] 

Another application of large-scale multiple testing is variable hltering in 
high-dimensional prediction: the objective is to predict a categorical or con¬ 
tinuous variable Y based on a family of potentially discriminating features 
Xy,v € V. Learning a predictor Y from i.i.d. samples of 1/ = {Y, Xy,v G V) 
is often facilitated by limiting a priori the set of features utilized in training 
y to a subset A C V determined by testing the features one-by-one for 
dependence on Y and setting a signficance threshold. In most applications 
of machine learning to artihcial perception, no premium is placed on prun¬ 
ing A to a highly distinguished subset; indeed, the particular set of selected 
features is rarely examined or considered of significance. In contrast, the 
identities of the particular features selected and appearing in decision rules 
are often of keen interest in computational genomics, e.g., discovering can¬ 
cer biomarkers, where the variables Xy represent “omics” data (e.g., gene 
expression), and Y codes for two possible cellular or disease phenotypes. 
Obtaining a “signature” A devoid of false positives can be beneficial in un¬ 
derstanding the underlying biology and interpreting the decision rules. In 
this case the Gene Ontology (GO) [1] provides a very rich hierarchical struc¬ 
ture, but one example being the organization of genes in pathways. Indeed, 
building predictors to separate “driver mutations” from “passenger muta¬ 
tions” in cancer would appear to be a promising candidate for coarse-to-fine 
testing due to the fact that drivers are known to cluster in pathways. 

There is a literature on coarse-to-fine pattern recognition (see, e.g., [6] and 
the references therein), but the emphasis has traditionally been on compu¬ 
tational efficiency rather than error control. Gomputation is not considered 
here. Moreover, in most of this work, especially applications to vision and 
speech, the emphasis is on detecting true positives (e.g., patterns of interest 
such as faces) at the expense of false positives. Simply “reversing” the role 
of true positives and negatives is not feasible due to the loss of reasonable 
invariance assumptions; in effect, every pattern of interest is unique. 

Finally, in [16] , a hierarchical testing approach is used in the context of the 
FWER. However, the intention is to improve the power of detection relative 
to the Bonferroni-Holm methods only at level of clusters of hypotheses; in 
contrast to our method, the two approaches have comparable power at the 
level of individual hypotheses. 

Organization of the Paper: The paper is structured as follows: In section 
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2 we present a Bonferroni-based inequality that will be central for controlling 
the FWER using the coarse-to-fine method in different models. In section 

3 will consider a parametric model that will illustrate precisely the way we 
control the FWER at a fixed level and permit a power comparison between 
coarse-to-hne and Bonferroni-Holm. We then propose a non-parametric pro¬ 
cedure in section 4 under general invariance assumptions. A method for 
estimating an upper bound on the number of active cells and incorporat¬ 
ing it into the testing procedure without violating the FWER constraint 
is derived in section 5. Finally, some concluding remarks are made in the 
Discussion. 

2. Coarse-to-fine framework. The finite family of null hypotheses 
will be denoted by {Ho{v),v G V), where Hq is either true or false. We are 
interested in the active set of indices, A = {v G V : Hq{v) = false} and will 
write Vq = A'^ for the set of inactive indices. Suppose our data U takes values 
in U. The set A(U) is commonly designed based on individual rejection 
regions Fj, C U, with A(U) = {u : U G F^}. As indicated in the previous 
section, in the conservative Bonferroni approach, the FWER is controlled at 
level a by assuming \V\ max„gvb IP’(U G F^) < a. If the rejection regions are 
designed so that this probability is independent of v whenever Hq{v) = true, 
then the condition boils down to P(U G F^,) < q;/|R| for v G Vq. Generally, 
Ty = {u G U : Ty(u) < fj for a constant t for some family of test statistics 
{n,vGV). 

While there is not much to do in the general case to improve on the 
Bonferroni method, it is possible to improve power if V is structured and one 
has prior knowledge about way the active hypotheses are organized relative 
to this structure. In this paper, we consider a coarse-to-fine framework in 
which V is provided with a partition G, so that V = where the 

subsets g C V (which we will call cells) are non-overlapping. For v G V, we 
let g{v) denote the unique cell g that contains it. The “coarse” step selects 
cells likely to contain active indices, followed by a “fine” step in which a 
Bonferroni or equivalent procedure is applied only to hypotheses included 
in the selected cells. More explicitly, we will associate a rejection region F^ 
to each g G G and consider the discovery set 

(1) i(U) = {uGR:UGF^(,)nr4. 

We will say that a cell g is active if and only if g D A ^ 0, which we 
shall also express as Ho{g) = false, implicitly defining Ho{g) as the logical 
“and” of all Ho{v),v G g. We will also consider the double null hypothesis 
Hoo{v) = Ho{g{v)) of v belonging in an inactive cell (which obviously implies 
that V is inactive too), and we will let Vqo C Vq be the set of such v’s. 
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Let ug denote the size of the largest cell in G and J be the number 
of active cells. We will develop our procedure under the assumption that 
J is known, or, at least bounded from above. While this can actually be a 
plausible assumption in practice, we will relax it in section 4 in which we will 
design a procedure to estimate a bound on J. Then under these assumptions 
we have the following result: 

Proposition 2.1. With A defined by (1).- 

FWER(^) < I PI max P (U G Pof.,;') n P.^) + vg J maxP (U G P^). 
v&Voo v&Vo 

Proof. This is just the Bonferroni bound applied to the decomposition 

i(u) n Po / 0 = U (u e n r,) u \J (u g rg(,) n r,) 

vGVoo 'wSV'oVVbo 

c U (u e r^(.) n p.) u \J (u g p.) 

v&Voo 'weVbVVbo 

SO that 

P(i(U) n Po / 0) < iPool max P (U G rg(,) n P.) + |Po \ Pool maxP(U G P.) 

v&Voo veVo 

and the proposition results from |Poo| < |P| and |Po \ Pool < J■ D 

The sets Pg and P.,, will be designed using statistics Tg{\]) and r,;(U) 

setting Pg = [Tg(U) < Og] and r.u = [T?;(U) < 6v] for some constants 
9g and By, and assuming that the distribution of (Tg(^,)(U),T^(U)) (resp. 
T;(U)) is independent of u for u G Poo (resp. v G Po). Letting poo(^G) ^v) = 

P ({rg(.)(U) < Og} n {r,(U) < Ov}) for v G Poo andpo(0u) = P (7;(U) < 9v) 
for u G Po, the previous upper bound becomes 

(2) FWER(i) < \ V\poo{9G,9v) + nGJpo{0v)- 

In the following sections our goal will be to design 9g and By such that this 
upper bound is smaller than a predetermined level a. Controlling the second 
term will lead to less conservative choices of the constant By (compared to 
the Bonferroni estimate), as soon as vgJ ^ 11^1 (or J <C |G| if all cells have 
comparable sizes). Depending on the degree of clustering, the probability poo 
of false detection in the two-step procedure can be made much smaller than 
Pq without harming the true detection rate and the coarse-to-fine procedure 
will yield an increase in power for a given EWER. We require tight estimates 
of Poo and taking into account the correlation between Tg(^)(U) and rj;(U) 
is necessary to deal with “survivorship bias.” 
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3. Model-based derivation. 


3.1. Regression model. In this section, the observation is a realization of 
an i.i.d. family of random variables U = , X^), k = 1,.. .n) where the 

y’s are real-valued and the variables = (X^,v G F) is a high-dimensional 
family of variables indexed by the set V. We assume that the distribution of 
Xy,v € V, are independent and centered Gaussian, with variance , and 
that 

yfc = ao + a^X’^ + 

vGA 

where are i.i.d. Gaussian with variance and a„,u G A, are 

unknown real coefficients. We will denote by Y the vector {Y^,, Y^) and 
by Y = (ELi yV n) In where In is the vector composed by ones repeated 
n times. We also let X„ = {X^, ..., Y") and ^ = (^^,..., so that 

Y = ^ ^ OtiXy Y 
veA 

Finally, we will denote by the common variance of Y^,... ,Y^ and 
assume that it is known (or estimated from the observed data). 


3.2. Scores. For u G Y, we denote by Py the orthogonal projection on 
the subspace Sy spanned by the two vectors X^, and In- We will also denote 
by Pg {g G G) the orthogonal projection on the subspace Sg spanned by 
the vectors X^,, v G g, and 1„. The scores at the g level and v level will be 
respectively: 


Tg{U) 




2 


and 



(The projections are simply obtained by least-square regression of Y on 
X^,, V G g, ioi Pg and on X^, for Py.) We now provide estimates of 


Poo{0g, Sv) = IP 







for V G Yoo and g = g{v) and 



Po{ev) = P 
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for V € Vo. Note that, because we consider residual sums of squares, we here 
use large values of the scores in the rejection regions (instead of small values 
in the introduction and other parts of the paper), hopefully without risk of 
confusion. 


Proposition 3.1. For all 9 g and By: 

Poo{0G,ev) < c(PG)exp ef (l - Gf, 

+ (1 - Fi(6*g - pg + 1)), 


where G^(x,a, 6) is the CDF of a f3{a,b) distribution evaluated at x and: 

n’f + l) 


C{vg) = 


exp(^; 


- 1)'^'+ 1) 


Moreover 

PoiBy) < 1 - Fi{9y) 

where F^ is the c.d.f. of a chi-squared distribution with k degrees of freedom. 
Proof. For v £ Vqo and g = g{v), we can write 


|P„Y||2-||Y||" 


cri 


> Og; 


|P„Y||2- IIYII^ 


cr^ 


> By 


IP.YIP-IIYI'^ 


a- 


> 0G', 


|P„Y||2-||Y||" 


Y-9 


a- 


> By 


Y-s 


because (Ty_^ = An g'^ = A. 

Consider the conditional probability: 


|P„Y|P-||Y|'^ 


a- 


> 0G] 


|P„Y||2-||Y|'^ 


Y-3 


a- 


> By 


Y-s 


(X, 


■V )v£g 


The conditional distribution of Y given (X„)^gg is Gaussian AA(0, o'Y_g x 
In) (where In is the n-dimensional identity matrix). Denote by the pro¬ 
jection on the orthogonal complement of J in and by Pg the projection 
on the orthogonal complement of Sv in Sg, so that 
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and 

||P„Yf - ||Yf = IIP'Y 

This implies that; 




lYll^ 


a- 


Y-9 


> Og] 


\p',-y 


\p,Yr - 


lYll^ 


I" + WP'Yf 


> Oy 


fx. 


a- 


> Og', 


\PLY\ 


Y-9 


a- 


V )v£g 


> By 


Y-9 


(X.) 


v&g 


At this stage, applying Cochran’s theorem to P'giY/aY-g) and -P^(Y /aY-g), 
which are conditionally independent given X^,,r; 0 G, reduces the problem 
to finding an upper bound for: 


T (?? + C > C > By ), 

where 77 is x^(*^G ~ 1) and C is x^(l)) and the two variables are independent. 
Let us write this probability as 


(lr;+C>(?G ^C<®G—*^G + l) P (^»?+C>^G ^C^^G —*^G + l) ’ 

which is less than; 

IE(lr,+c>eGlc>0vlc<6'G-i^G+i) + (i- “ Pii^G -i^G + !))• 

(Here, E refers to the expectation with respect to P.) 

Consider the first term in the sum; E This 

term can be re-written as: 

E (E(1^>0q_^|C)1<;>0^1^<0q_,,q+i) . 

At this stage, we will use the following tail inequality for x^ik) random 
variables ; 

1 - Fk{zk) < ( 2 :exp(l - z))^, 

for any z > 1. We apply this result to /c = t'c — 1 and 2 : = to get the 

upper bound: 

E(E(l^>0Q_^|C)lc>ev'lc<(?G-i'G+i) < IE f ^ exp 


<>dv 
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Since the density of a x^(l) is proportional to exp(— 2 , the term in 
exp I will cancel in the last integral (expectation). Using a simple change of 
variables in the remaining integral, we have as a final upper bound: 


C{ug )exp 



l-Gf 


1 r'G + 1 \ \ 

V^g’2’ 2 JJ 


where Gp{x,a,b) is the CDF of a Beta(a,b) evaluated at x. 

The second upper-bound, for po{6v), is easily obtained, the proof being left 
to the reader. □ 


This leads us immediately to the following corollary: 

Corollary 3.1. With the thresholds 9q and By, an upper bound of the 
FWER is: 

(3) FWER(i)<|U|C(z.G)exp(-^)0f (^1 _ 1, ^ 

-I- Jt'G (1 - Fl{0v)) ■ 

Figure 1 provides an illustration of the level curves associated to the above 
FWER upper bound. More precisely, it illustrates the tradeoff between the 
conservativeness at the cell level and the individual index level. In the next 
section, the optimization for power will be made along these level lines. 
Eigure 1 also provides the value of the Bonferroni-Holm threshold. For the 
coarse-to-fine procedure to be less conservative than the Bonferroni-Holm 
approach, we need the index-level threshold to be smaller, i.e., the optimal 
point on the level line to be chosen below the corresponding dashed line. 

The derivation of (3) is based on the assumption that we have a fixed cell 
size (across all the cells), which is not needed. In the case where the size of 
the cell is varying, it is easy to generalize the previous upper bound. Letting 

<P{i^g,Og,0v) = C(z2G)exp (^1 _ 

it suffices to replace \V\4>{yG,dG,0v) in (3) with EgsG Is'l'/’ds'l, y^\0G,0v) 
where 9g does not depend on the cell g. 

3.3. Optimal thresholds. Equation (3) provides a constraint on the pair 
(9g,9v) to control the FWER at a given level e. We now show how to 
obtain “optimal” thresholds {9q, By) that maximize discovery subject to this 
constraint. The discussion will also help understanding how active indices 
clustering in cells improve the power of the coarse-to-fine procedure. 
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Fig 1. Level curves of the upper bound of the FWER for the levels 0.2 (blue), 0.1 (green) 
and 0.05 (red). The horizontal dashed lines represent the thresholds at the individual level 
for a Bonferroni-Holm test, with corresponding colors. 


The conditional distribution of Y given (X„, u G g) is ^o'\_g) 
with ^Y-g ~ Ylv&Ang^ It follows from this that, conditionally to 

these variables, (||PgY|| — ||Y|| )/aY_g follows a non-central chi-square dis¬ 
tribution X^iPg0^v,V £ g)T^g)^ with 


Pg(Xv,v G g) 


IE. 


v&gnA 




a- 


Y-g 


where 

to 


n X^k=i ^v^n- Using the fact that pg{X.v,v ^ g) In converges 


Pg ■■= 


E 


DSgnA ^v 


alal 


a- 


Y-g 


we will work with the approximation 


l^gYr- 


lYlP 


a- 


Y-g 


X^inpg, 


With a similar analysis, and letting for v £ A, cJy_^ = Eu'eAVi; 
we will assume that 



a 


2 

Y-v 


X^inpvA) 


r\j 
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with 


Pv ■— 


ala^ 


o- 


Y-v 


We now have the simple lemma 


Proposition 3.2. If Zi ~ Z 2 ~ x^(P2jP2)j then for all 

9i, 62 such that Oi < pi + Vi, i=l ,2 

(4) 



Proof. This is based on the inequality [15], valid for Z ~ x^(/9, p): 
¥ (^Z < p + u — 2\J{v + 2p)a^ < exp(—x). 

which implies 


P (Z < 0) < exp 


{^ + P-ef \ 

4(p + 2 p ) j 


as soon as 0 < p + p, and on the simple lower-bound 

2 

P (Zi > 0i; Z 2 > 02) > 1 - J] P{Zi < ef). 

i=l 


□ 

This proposition can be applied, in our case, to (pi,Pi) = (npg, jpj) and 
( 02 ) ^' 2 ) = {npv, !)• More concretely, we fix a target effect size p (the ratio of 
the effect of compared to the total variance of Y), and a target cluster 
size, k, that represents the number of active loci that we expect to find in 
an active cell, and we take py = p and pg = kp to optimize the upper- 
bound in (4) subject to the FWER constraint (3) and Oq < npg + jpj and 
9v < npy -|- 1 to find optimal constants (0 g, 0y) for this target case. This is 
illustrated with numerical simulations in the next section. 

3.4. Simulation results (parametric case). Figures 2 compares the pow¬ 
ers of the coarse-to-fine procedure and of the Bonferroni-Holm procedure 
under the parametric model described in the section. 

The parameters chosen in our simulations were taken with our motivating 
application (to GWAS) in mind. Thinking of Y as a phenotype, and F as a 
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Fig 2. Comparison of the power of different methods. We eompare the detection rate of 
an individual active index for different number of active indices in the cell containing that 
index. The coarse-to-fine method is more powerful when the number of active indices is 
two or greater. This confirms the intuition that the more the clustering assumption is true, 
the more powerful is the coarse-to-fine method compared to Bonferoni-Holm approcach . 


set of SNP’s, we assimilate cells g ^ G to genes. We used |P| = 10000 and 
\g\ = 10 . The true number of active variables is 50 with a corresponding 
coefficient = 1 for each of them, and we generate the data according 
to the linear model described in this section with a variance noise that is 
equal to 10 . We assumed that we knew an upper bound J for the number 
of active sets (this assumption is relaxed in section 5). To compute the 
optimal thresholds, some values for pg and have to be chosen (this should 
not be based on observed data, since this would invalidate our FWER and 
power estimates). In our experiments, we optimize the upper bound on the 
probability for an active variable to be detected in an active cell by choosing 
Pg = 2/J and = 1/J. This corresponds to an ’’almost non noisy case” 
where the effect size of the “gene” is two times the effect size of the “SNP”. 

4. Non-parametric coarse-to-fine testing. 

4.1. Notation. Recall that U denotes the random variable representing 
all the data, taking values in U. We will build our procedure from user- 
defined scores, denoted py (at the locus level) and pg (at the cell level), both 
defined on U, i.e., functions of the observed data. 

Moreover, we assume that there exists a group action of some group © 
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on li, which will be denoted 


(?,u) ^0 u. 

For example, if, like in the previous paragraph, one takes 

U= ((y^X"),fe = l,...,n) 

and = (X^) ^gy, we will take S to be the permutation group of {1,..., n} 
with 

e0U= ((y«^X'^),fe = l,...,n). 

To simplify the discussion, we will assume that S is finite and denote by /x 
the uniform probability measure on ©, so that 

?ee 

We note, however, that our discussion remains true if © is a compact group, 
/X the right-invariant Haar probability measure on & and 0 u), pg 0 u) 

are continuous in 

Our running assumption will be that, 

1. For any v G FqO) the joint distribution of {Pg{v){{i'0 © 0 

U))^'g6 is independent of ^ G S. 

2. For any v G Vq, the joint distribution of {pv{{C'0 © U))^/ge is inde¬ 
pendent of ^ G ©. 

We will also use the following well-known result. 

Lemma 4.1. Let X be a random variable and let Fx denote the left limit 
of its cumulative distribution function, i.e., Ff^{t) = P{X < t). Then, for 
t G [0,1], one has 

P{F-{X)>l-t)<t 
(with equality if F is continuous). 

4.2. Asymptotic resampling scores. We define the asymptotic scores at 
the cell and variable level by 

Tg(u) = /X (^ : Pg{u) < Pg{i 0 u)) 




(5) 

and 

( 6 ) 


r^(u) = IX (^ : /5^(u) < py{^ 0 u)). 
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Tg(\J) and T^(U) are the typical statistics used in randomized tests, esti¬ 
mating the proportion of scores that are higher than the observed one after 
randomization. For the coarse-to-fine procedure, we will need one more “con¬ 
ditional” statistic. 

For a given constant 9g, we define 

(7) iV^^«(u) = /.(^r,(C©u)<0G). 

We then let 

(8) T*o(u) = * u {£ : p,(u) < p„(( © 0 u) < Sa) ■ 

We call our scores asymptotic in this section because exact expectations 
over /j, cannot be computed in general, and can only be obtained as limits 
of Monte-Carlo samples. The practical finite-sample case will be handled in 
the next section. 

We this notation, we let 

A = {v. T 3 (©(U) < Og and < Oy and r,(U) < O'y} 

which depends on the choice of three constants, 9y,9G and By. We then 
have: 

Theorem 4.1. For all v £Vq: 

(9) P G i) < B'y 
and for all v G Vqo, 

(10) P G i) < BGBy 

This result tells us how to control the FWER for a two-level permutation 
test based on any scores in the (generally intractable) case in which we can 
exactly compute the test statistics, when we declare an index v active if and 
only if TgiV) < 6 g and r®G(U) < By and r^(U) < B'y. 

Proof. For (9), we use a standard argument justifying randomization 
tests, that we provide here for completeness. If u G Vq, we have 

P (u G i) = P (r3(U) < Bg; < By,n(U) < B'y^ 

<F{T,iU)<B'y). 
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From the invariance assumption, we have 

p (7;(u) <e'v) = p (t;(^ o u) < e'v) for aiu g e 

= [ F{m^Q\j)<e'y)d^i{0 
Je 

It now remains to remark that 0 U) = 1 — F^{C{i)) where C, is the 
random variable on S defined by C(?0 = /Ov(C^©U), so that, by Lemma 4.1, 

ir : T,(e © U) < e'y) = /X m)) > 1 - O'y) < 9'y, 

which proves (9). 

Let us now prove (10), assuming v G Vbo and letting g = g{v). We write 

P (n G i) < P (t,(,)(U) < 0G;r^(U) < 9v). 

and find an upper bound for the right-hand side of the inequality. Using the 
invariance assumption, we have, for all ^ € G, 

p (r 3 (u) < 9g; r^(u) < 0 y) = p (r 3 (^ © u) < © u) < 9v) 

= y^P (Tg(e' ©U) < ©U) < 9v) dg{i') 

= E (/X (e' : r,(e' © U) < 9g-, QlJ)<9v)). 

Notice that, since /x is right-invariant, we have Ng^{^' © U) = N^^CU) and 

(e' © U) = y ^ (e : PviC' © U) < © e' © U); r,(^ © e' © U) < 9g) 

^9 (s © U) 

= U ■■ P-i^' ® U) < © U);r,(^ © U) < 9g) . 

^9 ) 

Let /x denote the probability /x conditional to the event Tg{^ © u) < 

(u being fixed). Then 

(Tgii' © u) < 9g-, Qvi)<9y)=g (^' : p{0 < 9v) , 

^9 H ^ ^ 

where 

PiC') = P{C- Pv{i © u) > py{^' © u)) . 
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Hence, Lemma 4.1 implies that, for each u: 

(C' : Tg{^' O u) < eG-X^{^ O u) < By) < By. 

Ng^{u) ^ ^ 

Hence, 

P (Tg(U) < 0 g;T^(U) < 0y) < E (iV|G(u)0^) = ByW. (<«(U)) . 

Applying Lemma 4.1 to the random variable ^ i—>■ Pg{^) for the probability 
distribution /r, we immediately get A'g‘3(U) < Bq so that 

P (r 3 (U) < Bg-Xo{\]) < By) < BGBy. 

□ 


As an immediate corollary, we have: 

Corollary 4.1. 

FWER(i) < \V\BGBy + JvgB'y. 

Remark 4.1 (Continuous approximation). Even though & is finite, it 
is a huge set in typical applications, and while Lemma (4.1) only provides 
inequalities for diserete distributions, we can safely ignore the discontinuity 
in practice and work as if the distributions to which we applied this lemma 
were continuous. Doing so, it is easy to convince oneself, by inspecting the 
previous proof that our estimates become (for v G Vqo, g = g{v)) 

P (r 3 (U) < 0g;T^(U) < By) = BGBy. 

This implies that 

P (r^(U) < By\Tg{V) < Bg) = By, 
because we have also: 

F{Tg{\J)<BG) = BG. 

This tells us that, conditional to Tg{\]) < Bg, ri^°(U) is uniform distributed 
on [0,1]. 

As mentioned above, this result does not have practical interest since 
it requires applying all possible permutations to the data. In practice, a 
random subset of permutations is picked instead, and we will develop the 
related theory in the next section (using these inequalities as intermediary 
results in our proofs). 
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4.3. Finite resampling scores. We now replace Tg, and with 
Monte-Carlo estimates and describe how the upper bounds in Theorem 4.1 
need to be modified. 

For a positive integer K, we let denote the ilT-fold product measure on 
S, whose realizations are K independent group elements ^ £ 

. We will use the notation P and E to denote probability or expectation 
for the joint distribution of U and ^ (i.e., P = P® /r^). We will also denote 
by the empirical measure 

I ^ 

k=l 

With this notation, we let 

1 ^ 

fg{u,^) = (e' : Pg(u) < Pg{i' 0 u)) = — lp 9 (u)<p,(C,©u), 

k=l 

f;(u) = jl^ {i' : p^(u) < © u)) 

and 

^ U' :p„(u) </j^(^'0u);fg(^'0u,^) < (% + 2eG 

where 

(c' : fgii' 0 u) < {Og + eg)) . 

We can now define 

A = {v. fg( 0 (U) < Og and < Qy and T;(U) < O'y} 

and state: 


Theorem 4.2. Making the continuous approximation described in Re¬ 
mark 4-1, the following holds. For v G Vq, 


( 11 ) 


P (u G i) < 9'y. 


and, for v G Vbo and g = g{v), 

( 12 ) 

P G < {K+1) exp f —2{K — 1) (cg 


K -1 



K{6g + eg)0v + 1 


K + l 
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Corollary 4.2. The FWER for the randomized test is controlled by 


FWER < |R| exp {-2Ke%) + Rexp -2{K - 1) ec - 


1 


+ 


K -1 

K{6g + £gWv + 1 


K + l 


+ J ngO'y. 


Proof of Theorem 4.2. We start with (11) which is simpler and stan¬ 
dard. Let V G Vq. Conditionally to U, KTyfU,^) follows a binomial distri¬ 
bution Bin(iL, T„(U)) (with Ty defined by equation (6)), so that 



where [•] denotes the integral part. The continuous approximation implies 
that r„(U) is uniformly distributed, so that 

E (T,(U)^(1 - r,(U))^-^) = P(1 - tf-^dt = 

yielding 

P{fy < 9'y) = < e'y. 


We now consider (12) and take v G VbO) 9 = 9iv)- We need to prove that: 


p (fg(u,o < 0G;r^’"«(u,^) < Oy) 

< exp {—2KeQ)+K exp [ —2{K — 1) (eg 


1 

K -1 


\ K{6g + £g)9v + 1 

J K + l 


Given U, is the empirical mean of K i.i.d Bernoulli random 

variables (T): = lpg(u)<pg(5fe©u)) with success probability Tg{\J). Hoeffding’s 
inequality [12] implies that 

P (fg{U, 4) < Tg{\J) - ecju) < exp (-2iLe^). 
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So: 

p (f^(u,0 < 0 g;T’^’=g(u) < 0y) = e (p (fg(u,o < < 0y|u)) 

< E (p (Tg(U) < < 0y|u)) 

+ exp {-2K£q) 

(13) = p ( t ,( u ) < 0G + < Ov) 

+ exp {-2K£q) 

We now fix io in 1, ...,P and consider 

P (fgiii, © U, ^) > 0 U) + eclU, Cio) • 


Since this probability does not depend on which zq is chosen, we will estimate 
it, without loss of generality, for zq = 1) which will simplify the notation. 
For this, notice that 

1 1 ^ 

fg{^l ® U,^) = ^ ^ X] lp9«i0U)>P«(«lOU)> 

i=2 


and also that 


(lpg(€i©U)>Pg(60U)| U,6) - Pgii O U) > Pg{il 0 U)) - rg(^l©U). 

Now we use Hoeffding’s inequality to obtain 



The same upper-bound applies to P (Tg{^i © U,^) > Tg{^i © U) -|- eclU^ 
by taking the expectation with respect to ^i, and one deduces from this that 

P (3i : fgi^i © u, ^) > Tgi^i © U) + eg) 

= E (P (dz : © U, ^) > Tgi^i © U) + eclu)) 

< K exp (-2{K - 1) 1 • 
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Introduce 

= TT^k (e' : Pv{u) < p„(|'0u);Tg(e'Ou) < {60 + 80 )) • 

60 + 80 

On the event 

C=(fg{CiQV,^)<Tg{^iQV)+eo,i = h...,K) 

one has 

(e' : Pv{U) < Pv{i' © U);fg(^' 0 U,^) < {Oo + 2eo)) • 

Moreover, we have Ng^’^°{\J,^) < 9o + 80 , by applying Lemma 4.1 to the 
random variable ^ 1 —>■ © U) under the distribution . This implies 

that, on C, we have 


which implies 

(14) P (Tg{V) < % + eG;r^’"^(U,0 < Ov) 

< iLexp U2(iL - 1) (ec - j +P {Tg{V) < Og'X’^^’^o^V,^) < Oy) ■ 


We now provide an estimate for the last term in (14). We have 
P (Tg{V) < % Tec;(U,0 < Ov) 

= P (f;’^«’^«(U,0 < Ov\Tg{V) < do + 8 o)r{Tg(U) < 9o + 8 o) 

(15) 

< {9o + 8 o)xP (f;-''«'^G(U, ^) < 9v\Tg{-U) <9o + 80 ) ■ 


Now write 


(f;’^G’^«(U,0 < Ov\Tg{U) < 9 o + 8o) 

= P(/if(^':p,(U) <p,(e'0U); 

Tgie 0 U) < {9o + 80 )) < 9v{9o + 8 o)\Tg{U) < 9o + 80 ) 

= E{P (/if(^':p„(U)<p,(C'0U); 

rg(e'0U) < {9o + 8o))<9v{9o + 8o)\V) |t3(U) < 0g + eg) 
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Given U, the probability (for fj.) of the event 

(p.(U) < 0 U); Tg{^' 0 U) < (0G + eg)) 


is p(U) = A^g^G+^G(u)r®G+£G(U) and 

: p,{V) < p,{^' Q\jy,Tg{(' QV) < (% + ec)) 

follows a binomial distribution Bin(/L,p(U)). We make the continuous ap¬ 
proximation (Remark 4.1) to write = [9g + £g), and to use 

the fact that is uniformly distributed on [0,1] conditionally to 

Tg(JJ) < 6 g + sgi so that p(U) is uniformly distributed on [0, Oq + sg] given 
7g(U) < 9g- This implies 



[K{9g + £g)(^v] + 1 

{K + l){9G + eG) 


since each of the integrals is equal to 1 (they are densities of Beta distribu¬ 
tions). From (15), we find 



which, combined with equations (13) and (14) completes the proof of the 


theorem. 


□ 


4.4. Simulations. For the simulations, we generated data according to 
the parametric model described in section 3, and compared the detection 
rate using the non-parametric approach to the one obtained with thresholds 
optimized as described in section 3. We fixed the ratio between the non- 
parametric thresholds 9g and 9y to coincide with the ratio between the 
type I errors at the cell and index levels in the parametric case, i.e., 


9v _ ip(Tr’' > 

9g ~ P(rj“’' > eg**') 
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Fig 3. Comparison of the coarse-to-fine and Bonferroni-Holm methods in the parametric 
and non-parametric methods when data is generated from the parametric model. 


for V G VoO) where are the scores and thresholds de¬ 

signed for the parametric case. Fixing OylOc and the FWER uniquely de¬ 
termines the thresholds. 

Figure 3 provides a comparison between the coarse-to-fine approaches 
(parametric and non-parametric) and Bonferroni-Holm. In figures 4 and 5, 
the non-parametric coarse-to-fine method is compared with Bonferroni-Holm 
for two different index-level effect sizes. Using the notation of section 3, the 
cell-level effect size is defined by 

effect size(g) = a^a1la\ 

v&g 

while effect size(u) = at the index level. The range of index-level 

effect sizes we consider in these simulation is similar to the one observed in 
typical genome-wide association studies. 

5. Estimating the number of active cells. 

5.1. Notations, assumptions and starting point. We now focus on the 
issue of estimating the number of active cells, J, from observed data, since 
this number intervenes in our FWER estimates. We use a method inspired 
from [18, 17] for the estimation of false discovery rates, adapted to our 
context. Our estimation will be made based on cell statistics {Tg,g G G) 
under the following setting. 
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Eig 4. Comparison of the non-parametric Bonferroni-Holm and coarse-to-fine methods 
for an index-level effect size equal to 1/900. One can notice that when the clustering 
assumption is true, the coarse-to-fine method outperforms the Bonferroni-Holm approach 



Eig 5. Comparison of the non-parametric Bonferroni-Holm and coarse-to-fine methods 
for an index-level effect size equal to 1.5/900. A larger effect size (compared to figure 4 
improves the performance of the Bonferroni-Holm method. The coarse-to-fine method is 
performs better at levels that correspond to two or more active indices per cell. 
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Al. If (7 n A = 0 {g is inactive), then Tg is uniformly distributed. 

A2. If 51,52 are inactive, then Tg^ are independent. 

These assumptions will be justified in section 5.3. We will also assume that 
Tg takes large values when 5 is active, so that, for a suitable non conservative 
threshold to, we have P{Tg > to) ~ 1. To simplify the argument, we will 
actually make the approximation that: 

A3. There exists to G (0,1) such that P{Tg > to) = 1 if 5 bl A / 0. 

For t G [0,1], we define Dt = {g : Tg < t}. Let D be the set of active cells, 
and Go = G \ D. Note that J = \D\. Then, for t > to, 

e(iai) = 

\96G0 / 

= (|G|-J)t + J. 

We therefore have, for t > to, 

\Pt\ = (IG*! ~ J)t + J + Zt, 

where Zt is a centered random variable. The following proposition states 

that the process , for t > to has the covariance structure of a Brown- 
V 

ian bridge. 


E 


g^Dt 


yg^D 


Proposition 5.1. Under assumptions Al to A3, we have, forti,t 2 > to, 
coy{Zt^,Zt2) = (|G| - J)(min(ti,t2) - *1^2) 

Proof. Since Z* = EgsGo 

(17) cov{Zt„Zt,)= )• 

9i,ff2SGo 

If 91 92, then cov(l^^g^^^, = 0, and for 51 = 52(= 5): 

COv(l ,l^n ) = E^I^a ^E^I^a ^ 

= min(ti,t2) - fif2- 
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Finally, from (17), we get 

cov(Ztj, Ztj) = (|G| - J)(min(ti,t2) - ^1^2) 

which concludes the proof. 

We now make a Gaussian approximation for large G of the vector 
with C = (Zti,- ■ ■, ZtJ, with to < h < ■ ■ ■ < tk < 1. 


□ 


C 


Proposition 5.2. Using the previous notation, 




A7(0,r) 


when |G| diverges to infinity, where T is the eovariance matrix with entries: 


r(u j) = min{ti,tj) — titj 


Proof. Our assumptions ensure that C satisfies a central limit theorem 
conditionally to Y, with a limit, A7(0, P) that is independent of the value of 
U. This implies that the limit is also unconditional. 

□ 


We are now able to present our principal result which provides a high- 
probability upper bound for J. 


Theorem 5.1. Let Z ~ A7(0,1) he a randomization variable, indepen¬ 
dent of {Tg,g G G). For z G {1,..., n} and G > 0, define 


HiiG) = 


-{tiZ + C) + \I{U Z + GY + A {1 - ti){\G\ - IAJ 

'2(T-10 


Then 


J>\G\- maxHfiC) ) r/(G) 


where r]{C) < exp ■ 


As a consequence, given e > 0, let A = ^—2tn log(e). Then 

(18) Je = |G| - max^(Ge). 

i 

is such that 1P( J < Jg) < e (we use a randomly sampled value of Z). Let us 
now prove this result. 
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Proof. We know from Proposition 5.2 that the vector 

+ {til ■■■itk) Z —^ B + {ti, Z, 

where B ^ AA(0,r). Then, since min is a continuous function on M"", we 
deduce that: 


c 

\/(|G|-J) 


mm 


but: 




^min {Bi + tiZ) < ~ ^niax(—i?j — tiZ) > 


The process Mi = —Bi — tiZ, i = 1,..., n is a martingale and exp (XMi) is a 
submartingale for all A > O’s. Applying Doob’s inequality, then optimizing 
over A’s finally gives: 


^min(i?i + tiZ) < 


'2tn 


It remains to prove that 


J > |G| — maxHi{C) ) = P [ min [ — , - 


+ tiZ 1 < —C 


But: 


Then: 




V{\G\-J) 


\Dt^\-{\G\-J)ti-J 

V\G[^ 


*“(v(|G|-J) 


mm 


tiZ \ K —C 


I ^\G[^ , 


Solving the quadratic inequality for one finds that I (lgl^)*» J _|_ 

_ _ y/\G\-J 

tiZ < — C is equivalent to \/\G\ — J < y/Hi{C), which completes the 

proof. □ 
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5.2. Application to the coarse-to-fine algorithm. The previous section 
provided us with an estimator J = in (18) such that J > J with proba¬ 
bility larger than 1 — e, which implies that 

FWER(i) < \ V\poo{9g,Bv) + i^gJpo{0v), 
with probability 1 — e at least. 

We previously chose constants 9g and By by optimizing the detection rate 
on a well-chosen alternative hypothesis subject to the upper-bound being 
less than a significance level a. This was done using a deterministic upper- 
bound of J, but cannot be directly applied with a data-based estimation of J 
since this would yield data-dependent constant Bg and By, which cannot be 
plugged into the definition of the set A without invalidating our estimation of 
the FWER. In other terms, if, for a fixed number J', one defines Aji to be the 
discovery set obtained by optimizing Bg and By subject to \V\Poo{6g, By) + 
J'Po{By) < a, our previous results imply that FWER(j4j/) < a for all 
J' > J, but not necessarily that FWER( 74 j) < a + e. 

A simple way to address this issue is to replace Aj with 

i = n ^j'- 

j'<j 

Because A C Aj with probability at least 1 — e, we have 

FWER(i) = P{A n Ro / 0) < P{A J n Ro / 0) + e = FWER(ij) + e, 

so that A controls the FWER at level a -|- e as intended. 


5.3. Justification of A1 and A2. We check that conditions Al and A2 are 
satisfied for the two situations that we consider in this paper. In the example 
from section 3, we can take (using the same notation and introducing the 
c.d.f. of a chi-square distribution) 





(Recall that Pg is the orthogonal projection on the space generated by = 
{Xy,V G g) and 1. We also let a\ be the empirical variance of Y.) 

Note that the conditional distribution of Tg given Y = y is always uni¬ 
form over [0,1] and therefore does not depend on y, which proves that Tg 
and Y are independent. Similarly, taking gi g2 G Go, Tgi and Tg.^ are 
conditionally independent given Y (because and X^j are independent). 
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But and Tg^ being conditionally independent given Y and each of them 
independent of Y implies that the three variables are mutually independent. 

The same argument can be applied to the non-parametric case, when 
(now using notation from that section) one assumes that scores are such 
that /Og(U) = pg(Y,X.g), and uses -to simplify the discussion- the statistic 

Tg=p{^: PgiY, Xg) < P g Q (Y , ) ) ) , 

assuming, in addition, the following. If we denote by Xg the space where 
the random variable Xg takes its values, There exists a group S, a group 
isomorphism (j) between © and © and a group action of © on Xg that we 
will denote by 0 satisfying the two following conditions: 

• The distribution of Xg is invariant under the action 0. 

. Pg{^ 0 (Y, Xg)) = Pg{Y, .^(OOX,) 

For example, for permutation tests, the group © is simply the group of 
permutations © itself. The isomorphism (j) is the inverse map ^ —>■ The 

group action 0 is just the permutation of the observations. Finally, pg can 
be any score that is symmetric with respect to the observations. 

Assuming these conditions, one can immediately apply lemma 4.1 to con¬ 
clude. 

6. Discussion. Given a partition of the space of hypotheses, the basic 
assumption which allows the coarse-to-fine multiple testing algorithm to ob¬ 
tain greater power than the Bonferroni-Holm approach at the same FWER 
level is that the distribution of the numbers of active hypotheses across the 
cells of the partition is non-uniform. The gap in performance is then roughly 
proportional to the degree of skewness. The test derived for the parametric 
model can be seen as a generalization to coarse-to-fine testing of the F-test 
for determining whether a set of coefficients is zero in a regression model; 
the testing procedure derived for the non-parametric case is a generalization 
of permutation tests to a multi-level multiple testing. 

This scenario was motivated by the situation encountered in genome-wide 
association studies, where the hypotheses are associated with genetic vari¬ 
ations (e.g., SNPs), each having a location along the genome, and the cells 
are associated with genes. In principle, our coarse-to-fine procedure will then 
detect more active variants to the extent that these variants cluster in genes. 
Of course this extent will depend in practice on many factors, including ef¬ 
fect sizes, the representation of the genotype (i.e., the choice of variants 
to explore) as well as the phenotype, and complex interactions within the 
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genotype. It may be very difficult and uncommon to know anything specific 
about the expected nature of the combinatorics between genes and variants. 
In some sense, “the proof is in the pudding,” in that one can simply try both 
the standard and coarse-to-fine approaches and compare the sets of variants 
detected. Given tight control of the FWER, everything found is likely to be 
real. Indeed, the analytical bounds obtained here make this comparison pos¬ 
sible, at least under linear model commonly used in GWAS and in a general 
non-parametric model under invariance assumptions. 

Looking ahead, we have only analyzed the coarse-to-fine approach for the 
simplest case of two-levels and a true partition, i.e., non-overlapping cells. 
The methods for controlling the FWER for both the parametric and non- 
parametric cases generalize naturally to multiple levels assuming nested par¬ 
titions. The analytical challenge is to generalize the coarse-to-fine approach 
to overlapping cells, even for two levels: while our methods for controlling 
the FWER remain valid, they are likely to become overly conservative if cell 
overlap. This case is of particular interest in applications, where genes are 
grouped into overlapping “pathways.” For example, in “systems biology,” 
cellular phenotypes, especially complex diseases such as cancer, are studied 
in the context of these pathways and mutated genes and other abnormalities 
are in fact known to cluster in pathways; indeed, this is the justification for 
a pathway-based analysis. Hence the clustering properties may be stronger 
for variants or genes in pathways than for variants in genes. 
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